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Abstract 

Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization 
methods. While it has already been theoretically studied for decades, the classical analysis usually 
required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD 
with non-smooth objective functions such as support vector machines. In this paper, we investigate 
the performance of SGD without such smoothness assumptions, as well as a running average scheme to 
convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove 
that after T rounds, the suboptimality of the last SGD iterate scales as 0{log{T)/VT) for non-smooth 
convex objective functions, and 0(log(r)/r) in the non-smooth strongly convex case. To the best of 
our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates 
obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, 
which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix 
averaging scheme proposed in |Rakhlin et al.] ( |2011[ ) is not as simple to implement). Finally, we provide 
some experimental illustrations. 



1 Introduction 

This paper considers one of the simplest and most popular stochastic optimization algorithms, namely 
Stochastic Gradient Descent (SGD). SGD can be used to optimize any convex function F over a convex 
domain W, given access only to unbiased estimates of F's gradients (or more generally, subgradient^. This 
feature makes it very useful for learning problems, where our goal is to minimize generalization error based 
only on a finite sampled training set. Moreover, SGD is extremely simple and highly scalable, making it 
particularly suitable for large-scale learning problems. 

The algorithm itself proceeds in rounds, and can be described in just a few lines: We initialize wi e W 
(following common practice, we will assume wi ~ 0). At round t — 1,2,..., we obtain a random estimate 
gf of a subgradient gt € dF{wt) so that Egt = gf, and update the iterate Wt as follows: 

wt+i = nw(wt - r]tgt), 

where rjt is a suitably chosen step-size parameter, and IIw denotes projection on W. 

In terms of theoretical analysis, SGD has been studied for decades (for instance, see Kushner & Yin ( 2003 ) 
and references therein), but perhaps surprisingly, there are still important gaps left in our understanding of 
this method. First of all, most classical results look at asymptotic convergence rates, which do not apply to 
a fixed iteration budget T. In recent years, more attention has been devoted to non-asymptotic bounds (e.g., 



Following a common convention, we still refer to the algorithm in this case as "gradient descent" 
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Bach & Moulines (2011)). However, these classical convergence bounds often make non-trivial smoothness 
assumptions on the function F, such as Lipschitz-continuity of the gradient or higher-order derivatives. In 
modern applications, these assumptions often do not hold. For example, if SGD is used to solve the support- 
vector machine optimization problem (with the standard non-smooth hinge- loss) on a finite training set, 
then the underlying objective function F is essentially non-smooth, even at the optimal solution. In general, 
for machine learning applications F may be non-smooth whenever one uses a non-smooth loss function, and 
thus a smoothness-based analysis is not appropriate. 

Without assuming smoothness, most of the existing analysis has been carried out in the context of online 
learning - a more difhcult setting than our stochastic setting, where the subgradients are assumed to be 
provided by an adversary. Using onlinc-to-batch conversion, it is possible to show that after T iterations, the 
average of the iterates, (wi -I- ... -I- wt)/T, has 0(log(T)/T) optimi zation error for st r ongly-convex F (se e 



Zinkevich 



(2003); 



Hazan et al. 



(20071 



precise definition in Sec.[2|), and O^l/vT) error for general convex F ^ ^ ^ ^ ^ ^ 

Hazan & Kale (2011). However, Rakhlin et al. ( |2011 ) showed that simple averaging is provably suboptimal 
in a stochastic setting. Instead, they proposed averaging the last aT iterates of SGD (where a G (0, 1), e.g. 
1/2), and showed that this averaging scheme has an optimal 0{l/T) convergence rate. In comparison, in the 
non-smooth setting, there are n{l/VT) and n{l/T) lower bounds for convex and strongly-convex problems. 



respectively Agarwal et al. (20121 



These results leave open several issues. First, they pertain to averaging significant parts of the iterates, 
although in practice averaging just over the last few iterates, or returning the last iterate wt, often works 



quite well (e.g. Shalev-Shwartz et al. (2011 1). Unless F is smooth, the previous results cannot say much 
about the optimization error of individual iterates. For example, the results in Rakhlin et al. (2011) only 



imply an 0(l/yT) convergence rate for the last iterate wr with strongly-convex functions, and we are not 
aware of any results for the last iterate wt in the general convex case. In fact to the best of our knowledge, 
even for the simpler (non-stochastic) gradient descent method (where gt = gt), we do not know any existing 
results that can guarantee the performance of each individual iterate w^. Second, the theoretically optimal 



suffix- averaging scheme proposed in Rakhlin et al. (2011) has some practical limitations, since it cannot be 
computed on-the-fly: unless we can store all iterates Wi , . . . , wr in memory, one needs to know the stopping 
time T beforehand, in order to know when to start computing the suffix average. In practice, T is often not 
known in advance. This can be partially remedied with a so-called doubling trick, but it is still not a simple 
or natural procedure compared to just averaging all iterates, and the latter was shown to be suboptimal in 



Rakhlin et al. (2011). 



In this paper, we investigate the convergence rate of SGD and the averaging schemes required to obtain 
them, with the following contributions: 

• We prove that the expected optimization error of every individual iterate is C'(log(T)/T) for 
strongly-convex F, and 0{\og{T) / ^/T) for general convex F without smoothness assumptions on F. 
These results show that the suboptimality of the last iterate is not much worse than the optimal rates 



obtainable by averaging schemes, and partially addresses an open problem posed in Shamir (20121 



Moreover, the latter result is (to the best of our knowledge) the first finite-sample bound on individual 
iterates of SGD for non-smooth convex optimization. The proof relies on a technique to reduce results 



on averages of iterates to results on individual iterates, which was implicitly used in Zhang (2004| for 
a somewhat different setting. 



• We improve the existing expected error bound on the suffix averaging scheme of Rakhlin et al. (2011 1, 
from 0((1 + log(^))/aT) to 0(log(_^^^i^)/r). 

• We propose a new and very simple running average scheme, called polynomial- decay averaging, and 
prove that it enjoys optimal rates of convergence. Unlike suffix-averaging, this new running average 
scheme can be easily computed on-the-fly. 

• We provide a simple experimental study of the averaging schemes discussed in the paper. 

We emphasize that although there exist other algorithms with 0{1/T) convergence rate in the strongly 
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convex case (e.g. Hazan & Kale (2011); Ouyang &; Gray (2012)), our focus in this paper is on the simple 
and widely-used SGD algorithm. 



2 Preliminaries 

We use bold-face letters to denote vectors. We let F denote a convex function over a (closed) convex domain 
W, which is a subset of some Hilbert space with an induced norm |j • ||. We assume that F is minimized at 
some w* e W. Besides general convex F, we will also consider the important sub-class of strongly-convex 
functions. Formally, we say that a function F is X-strongly convex, if for all w, w' G W and any subgradient 
g of F at w, it holds that 

F(w') > F(w) + (g,w' -w) + ^\\w' - w|p, 

where A > 0. For a general convex function, the above inequality can always be satisfied with A = 0. 

As discussed in the introduction, we consider the first-order stochastic optimization setting, where instead 
of having direct access to F, we only have access to an oracle, which given some w € W, returns a random 
vector g such that E[g] e dF{w). Our goal is to use a bounded number T of oracle calls, and compute 
some w G W such that the optimization error, F{w) — F(w*), is as small as possible. It is well-known that 



this framework can be applied to learning problems (see for instance Shalev-Shwartz et al. (2009)): given a 
hypothesis class W and a set of T i.i.d. examples, we wish to find a predictor w whose expected loss F(w) 
is close to optimal over W. Since the examples are chosen i.i.d., the subgradient of the loss function with 
respect to any individual example can be shown to be an unbiased estimate of a subgradient of F. We will 
mostly consider bounds on the expected error (over the oracle's and algorithm's randomness) for simplicity, 
although it is possible to obtain high-probability bounds in some cases. 

In terms of the step-size rjt in the strongly-convex case, we will generally assume it equals l/{Xt). We note 
that this is without much loss of generality, since if the step size is c/Xt for some c > 1, then it is equivalent 
to taking step sizes l/(A't) where A' :— A/c < A is a lower-bound on the strong convexity parameter. Since 
any A-strongly convex function is also A'-strongly convex, we can use the analysis here to get upper bounds 
in terms of A', and if so desired, substitute A/c instead of A' in the final bound. 

When we run SGD, we let gt denote the random vector obtained at round t (when we query at Wt), and 
let gt = E[gt] denote the underlying subgradient of F. To facilitate our convergence bounds, we assume that 
lE[||gt|p] < for some fixed G. Also, when optimizing general convex functions, we will assume that the 
diameter of W, namely sup^j^, w'ew 11^ ^ bounded by some constant D. 



3 Convergence of Individual SGD Iterates 

We begin by considering the case of strongly convex F, and prove the following bound on the expected error 
of any individual iterate w^. In this theorem as well as later ones, we did not attempt to optimize constants. 

Theorem 1. Suppose F is X-strongly convex, and that E[||gt||^] < for all t. Consider SGD with step 
sizes rjt — 1/Xt. Then for any T > 1, it holds that 

Einwri-nw,i< "^''\;'°«'^» , 

Proof. The beginning of the proof is standard. By convexity of W, we have the following for any w e W: 

E[||wt+i - w||2] = E[||nw(wt-77tgt)-w||2] 

< E [||wt -Tytgt - w|p] 

< E[||wt-w||2] -2r/tE[(gt,wt-w)]+7y2G2. 
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Let k be an arbitrary element in {!,..., [T/2J}. Extracting the inner product, summing over all t = 
T — . . . ,T, and rearranging, we get 

V E[(gt,wt-w)] < E[||wT-fc - w|p] 

T 

+ E 

t=T-fe+l 



E[||W4 - W||2] f I 1 



G2 



E "It- 



(1) 



t=T-k 



By convexity of F, wc can lower bound (gt,Wt — w) by F(\vt) — F{w). Plugging this in and substituting 
rit = 1/Xt, we get 



E 



t=T-k 
T 



MT-k) „2, 
< ^E[||wT-fe -w||2] 



Wt — w| 



t=T-k+l 



2\ ^ t 



^ 1 



(2) 



t=T-k 



Now comes the crucial trick: instead of picking w = w* , as done in standard analysis ( Hazan et al. ( 2007 1 ; 



Rakhlin et al. 



(2011 )), we instead pick w = WT-fc- We also use the fact that E [||w( — w*|p] < (Rakhlin 



et al. " pOTT iTLemma 1), which implies that for any t>T — k, 



E[iiwt - wT-fcin 

<2E [||wt - w*jp + llwT-fe - w*|p] 



< 



8G2 /I 1 



\ t T-k 



< 



16G2 



A2(r- fc) 



< 



32G^ 
A2T 



Plugging this back into Eq. ([2]), we get 

T 

(F{^t) - F{^T-k)) 



E 



.t=T-k 



< 



16G2fc G2 



^ 1 

Y. 7 



XT 2X ^ t' 

t=T-k 



Let Sk — Ylt=T-k ^[P{^t)\ be the expected average value of the last /c + 1 iterates. The bound above 
implies that 

-E[F(w._.)]<-E[5.] + -(-+ 



By the definition of 5*^ and the inequality above, we have 

knSk-x\ = (fc + l)E[^fe] - E[F(wT-fe) 

G2 / 32 



< (fc + l)E[5fc] - E[5fc] + 



ffc + l) 



2A \ T ' ^ (A: + l)i r 



and dividing by fc, implies 



E[5fc_i] < E[5fc] 



G2 / 32 
2A i kT 



E 

t=T-k 



k{k + l)t 



(3) 
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Using this incquahty repeatedly and summing from fc = 1 to fc = [T/2J , we have 

AT ^ fc 

T 



E[F{wt)]^E[So]<E[S^^t/2\] 
1 



fc=i 



2A 



E E 

fc=l t=T-k 



k{k + l)t 



(4) 



It now just remains to bound these terms. E[SV/2]is the expected average value of the last [T/2J iterates, 
which was already analyzed in (Rakhlin et al. (2011), Theorem 5), yielding a bound of 

10G2 



E[%/2j] <T(w*) + 



AT 



for T > 1. Moreover, we have (1/^) < 1 + log(T/2). Finally, we have 

T 



1 



1 



LT/2J 

E E 

k{k + l)t - ^ k{T-k) 

LT/2J 



LT/2J 

< E 



T ^^k 

k=l 



1 



T~k 



)<(l + log(T))/T. 



The result follows by substituting the above bounds into Eq. Q and simplifying for readability. □ 

Using a similar technique, we can also get an individual iterate bound, in the case of a general convex 
function F that may be non-smooth. We note that a similar technique was used in Zhang ( 2004 1 , but for a 
different algorithm (one with constant learning rate), and the result was less explicit. 

Theorem 2. Suppose that F is convex, and that for some constants D, G, it holds that E[||g(||] < for all 
t, and sup^ ^,gyy ||w — w'|| < D. Consider SOD with step sizes r\t = cj\/t where c > is a constant. Then 
for any T > 1, it holds that 



E[T(wt) - F{w*)] < 



cG' 



log(T) 
T 



Proof. The proof begins the same as in Thm. [T] (this time letting k be an element in {1, . . . , T — 1}), up to 
Eq. ([l]). Instead of substituting T]t = c/Xt, we substitute r]t = c/\/t, to get the, E[||wt — w|p] by f^, pick 
w = WT-k and slightly simplify to get 



E [(gt, vi^t - wt-/c) 



< 



2c 



Vf-^/T-k 



G^ 
2 



E 

t=T-k 



By convexity, we can lower bound (gt,Wt — WT-k) by T(wf) — T(wT-fe). Also, it is easy to verify (e.g. by 



integration) that ^ 



t=T-k 



^ < 2{VT - VT-k-1), hence 

T 

E J2 (^(wt) - F(wT-fc)) 

cG2 



< 



< 



i=T~-k 



cG^ 
cG^ 



Vf -Vt -k-1 

k + l 

Vr + VT-k-i 

k + l 

w 



(5) 
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As in the proof of Thm. [ij let Sk = jrpx Y^^T-k ^i^i'^t)] be the expected average value of the last K + 1 
iterates. The bound above implies that 

-E[Fiv,T-k)] < -nsk] + ■ 

By the definition of Su and the inequality above, we have 

m{Sk-i\ = (k + l)E[Sk] - E[F{wT-k)] 

r)2 /Or 4- rC^ 

< {k + l)E[Sk]-E[Sk] + -^^ 



T 



and dividing by fc, implies 

ky 1 

Using this inequality repeatedly and by summing over fc = l,...,r— 1, we have 

r)2 /f) I f<2 T-1 -, 

E[f (wt)] = E[5o] < E[gT-i] + ^ ^ Efc- (6) 

^ k=l 

It now just remains to bound the terms on the right hand side. Using Eq. ([I]) with k — T — 1 and w — w*, 
and upper bounding the norms by _D, it is easy to calculate that 



^E[F(wt)-F(w*) 



.t=i 



E[5t-i]-^^(w*) - 1e 

Also, we have Y^'^=i^/k — (1 + log(^))- Plugging these upper bounds into Eq. ^ and simplifying for 
readability, we get the required bound. □ 

4 Averaging Schemes 

The bounds shown in the previous section imply that individual iterates have C'(log(r)/r) expected 
error in the strongly convex case, and O {\og{T) / \/T) expected error in the convex case. These bounds are 
close but not the same as the minimax optimal rates, which are 0{1/T) and 0{1/VT) respectively. In this 
section, we consider averaging schemes, which rather than return individual iterates, return some weighted 
combination of all iterates wi, . . . ,wt, attaining the minimax optimal rates. We mainly focus here on the 
strongly-convex case, since simple averaging of all iterates is already known to be optimal (up to constants) 
in the general convex case. 

We first examine the case of a-sujfix averaging, defined as the average of the last aT iterates (where 
a G (0, 1) is a constant, and aT is assumed to be an integer): 

1 ^ 

^T = ^ E 

t=(l-a)T+l 



In Rakhlin et al. (2011 1, it was shown that this averaging scheme results in an optimization error of 0{{1 + 
log( j^))/aT), which is optimal in terms of T, but increases rapidly as we make a smaller. The following 
theorem shows a tighter upper bound of 0(log( x-a} ) Z--^)' "which implies we can be much more flexible 
in choosing a. Besides being of independent interest, we will re-use this result in our proofs later on. 
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Theorem 3. Under the conditions of Thm.^ and assuming aT is an integer, it holds that K[F(w^)^F(w*)] 
is at most 

17G2 f 1 + log (^n—TTh^\ — r 

\ o \ mm{a,(l + l/T) — a} 



AT 

Proof. Suppose first that aT < [r/2j . The proof is mostly identical to that of Thm. [ij except that instead 
of using Eq. (jsj) to bound £[50], we use it to bound E[S'qt-i] = ^ J2t-=(i-a)T+i ^i'^t), which by convexity 
upper bounds F{w^). We get: 



E[5„T-l]<E[%/2j] + ^ E k 

k=aT 



k=aT t=T-k ^ ' 

Using the same argument as in the proof of Thm. 1 and the fact that J2k=aT — ^ ^ log(/3/a) for any 
integers aT, (3T that are no larger than T, we can obtain the upper bounds 

E[5'lt/2J <F{w*) + 10G^/XT 

1T/2J ^ 

^ - <l + log(l/2a) 



k 

k=aT 



and 



lr/2j T ^ ^ LT/2J ^ ^ 

^ ^ k(k + l)t - T ^ '•fc ^ T-k^ 

k=aTt=T-k ^ ' k=aT 

<| ((1 + log(l/2a)) + (1 + log(2(l - a)))) 
<^(2 + log(l/a)). 

Using the above estimates, with some simplifications for readability, we get that E[T(wy) — F(w*)] is at 
most 

1\\ G2 



This analysis assumed aT < [T/2\. If a is larger, we can use the existing analysis (Rakhlin et al. ( 2011[ ), 
Theorem 5), and get that E[F(w^) — F(w*)] is at most 

Combining Ec^. ([?]) and Eq. ([s]) with a uniform upper bound which holds for all a, we get the required 
bound. □ 

We note that in the general convex case without assuming strong convexity, one can use an analogous 
proof to show an upper bound of order \og{l/a)/\/T for a-sufhx averaging. In contrast, existing techniques 
only imply a bound of order l/V aT. 

As discussed in the introduction, a limitation of suffix averaging is that unless we can store all iterates in 
memory, it requires us to guess the stopping time T in advance. For example, if we do 1/2-sufhx averaging, 
we need to "know" when we got to iterate T/2 and should start averaging. In practice, the stopping 
time T is often not known in advance and is determined empirically (e.g. till satisfactory performance is 
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obtained). One way to handle this is to decide in advance on a fixed schedule of stopping times T (e.g. 
To, 2To, 2^ro, . . . for some To) and maintain suffix-averages only for those times. However, this is still 
not very flexible. In contrast, maintaining the average of all iterates up to time t can be done on-the-fly: we 
initialize wi = wi, and for any t > 1, we let 

Wt = ^1 - Wf_i + ^wt. (9) 

Unfortunately, returning the average of all iterates as in Eq. ^ is provably suboptimal and can harm 



performance Rakhlin et al. (2011). Alternatively, we can easily maintain and return the current iterate wj, 
but we only have a suboptimal 0{\og{t)/t) bound for it. 

In the following, we analyze a new and very simple running average scheme, denoted as polynomial- decay 
averaging^ and show that it combines the best of both worlds: it can easily be computed on the fly, and it 
gives an optimal rate. It is parameterized by a number 77 > 0, which should be thought of as a small constant 
(e.g. = 3), and the procedure is defined as follows: w^' = wi, and for any i > 1, 

^ ri + l\ .ri , + ! 

1 w/ 1 H wt. (10) 

For T] = 0, this is exactly standard averaging (see Eq. ([9|), whereas r] > reduces the weight of earlier 
iterates compared to later ones. Moreover, wj* can be computed on-the-fly, just as easily as computing a 
standard average. 

We note that after this paper was accepted for publication, a similar averaging scheme was independently 



proposed and studied in .Lacoste-Julien et al. (2012 1. Compared to our method, they consider a slightly 



different step-size and a specific choice of 77 = 1, using a more direct proof technique tailored to this case. 
An analysis of our averaging scheme is provided in the theorem below. 

Theorem 4. Suppose F is X-strongly convex, and that E[||gf|p] < for all t. Consider SGD initialized 
with Wi and step-sizes rjf = 1/Xt. Also, let rj > 1 be an integer. Then E [_F(w^) — _F(w*)] is at most 

58." " °' 



(i + i)(,,,, + i,+ (ii±MBi±MI») 



AT 



The assumption that rj is an integer is merely for simplicity. Also, we made no effort to optimize the 
constants, which can be easily improved for specific choices of rj (see the proof as well as the analysis in 



Lacoste-Julien et al. (2012)) 



Proof. We can rewrite the recursion as 



w/ = w/ 1 H Wf 

* t + T] t + T] ' 



for t > 1 with Wq ~ 0. Unwrapping the recursion, we have that for any T > 1, = X)t=i Q^tWt,, where 

T 



at 



and at t ~ T, the convention that nJ=T+i((j ~ l)/(i + v)) = 1 is used. 

We now denote T'(w) = F(w) — F(w*). Since is a weighted average of Wi, . . . , where the weights 
at sum up to be 1, it follows by the convexity of F and Jensen's inequality that F' (w^) < X^tli C(tF'{wt). 

Let S'f. = X)t=T-fe P'i'^t), and = 0, then we have 

T 

F'iw'>.)<J2{at-at-i)S'r_f (11) 
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It is not difficult to check that for aU i > 1: 



at - at^i 



Viv + 1) TT j - 1 



T+l 



'T{T + 1) 11 j-l + r; 



IT — 



V(v+i) ( t-2+r, 
< I T{T+l) \ T+r, 

T(T+1) 



r,-l 



if t < T + 2 - ?7 
otherwise 



(12) 



Let us suppose first that r/ > 2. In that case, we can upper bound the above by 

7?(r/ + l)(t + ry) 
T(r+l)(T + 2)' 

As to S!j'_^ in Eq. (Ill, note that the upper bound proof of Thm. js] equally applies to 
this bound and substituting in Eq. (11 1, we obtain 

F'(w?.) 



T-t- 



ZiSr^f Using 



where 



T 17G2 log ( 

<Y,{at - at-i){T - t + 1) ^ 

t=i 

\T/2\ 



Te 

min{t,T~t+l} 



XT 



XT 



^ 27y(7? + l)(t + 7?) 17GHogiTe/t) 

- ^ T(T + l)(T + 2)*^ 

34G^^(ry + l)(r + ry) 

- Ar2(r + i)(r + 2) ^^ + ^ + ^)' 



[T/2] 

A ^ ?7log(reA) < r;^log(re), 



(13) 



and 



and 



Therefore we have 



[T/21 



B= J2 t^^SiTe) < 0.5(rT/2l)(rT/2l + l)log(Te), 



t=i 



C<- V nog(t)<- / t\og 



(t)dt 

0.5rr/2]2log(r/2) + 0.25rT/2]2 -0.25. 



A + B + C 
<(77 + 0.5)^log(Te) 

+ 0.5 ^^^^^' log(2ei-5) - 0.25 

<(?? + 0.5)!^^ log(Te) + 0.5(r + 1)(T + 2). 
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Plugging this estimate into Eq. (13) and simplifying, we obtain an upper bound on E [F{wl^) — F{w*)] of 



the form 



17(l+|;) (^V{V+1) + 



(?; + 0.5)3(1 + log(T)) \ G2 
T XT' 



(14) 



It remains to treat the case rj = 1. In that case, the upper bound on at — at-i in Eq. (12) becomes 

2 

at - at-i < 



T{T+1) 



and using the same derivation as before, we get that 



rT/21 

^ log(Te/i) < 



Ar2 

By an integration calculation, it is easy to verify that 

log(Te) 

\ogiTe)-[tlog{t)-t]\[ 

(2 + log(2))-l. 

Plugging it in, we get an upper bound of 

68G2 rT/2l(2 + log(2))-l 



rT/21 



\og{t)dt 



< 



rT/21 



XT 



T 



which for any T > 1 is at most 116G^/XT. The stated result follows by combining this bound (for 77 = 1) 
and Eq. (14) (for i] > 2), increasing the numerical constant in Eq. (14) to obtain a uniform bound which 
holds for all choices of 77. □ 

Note that for a constant 77, the bound is essentially optimal. We end by noting that using an identical 
proof technique, it holds in the case of general convex F (with assumptions similar to Thm. ^ that 



E[F(w^) - F{w*)] < O 



r](D^/c + cG'^) 



T 



this implies that polynomial-decay averaging is also optimal (up to constants) in the general convex case. 



5 Experiments 



In this section, we study the behavior of the polynomial-decay averaging scheme on a few strongly-convex 
optimization problems. We chose the same 3 binary classification datasets ((cCAT,COVl and ASTRO-ph) and 
experimental setup as in Rakhlin et al. (2011). For each dataset {xi,?/^}™]^, we ran SGD on the support 
vector machine optimization problem 



F(w) 



A, 



1 

m 



^max{0, 1 - j/i(xj, w)}, 

i=l 



with the domain W = M'', where the stochastic gradient given Wt was computed by taking a single randomly 
drawn training example {xi,yi) and computing the gradient with respect to that example, i.e. gt = Aw^ — 
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lj/i(xi,wt)<i2/iXi- AH algorithms were initialized at wi = 0. Following previous work, we chose A = 10~* for 
CCAT, A = 10"^ for covl, and A = 5 x 10~^ for ASTRO-ph. The 77 parameter of polynomial-decay averaging 
was set to 3. For comparison, besides polynomial-decay averaging, we also ran suffix averaging with a — 1/2, 
and simple averaging of all iterates. The results are reported in the figure below. Each graph is a log-log plot 
representing the training error on one dataset over 10 repetitions, as a function of the number of iterations. 
We also experimented on the test set provided with each dataset, but omit the results as they are very 
similar. 

The graphs below clearly indicate that polynomial-decay averaging work quite well. Achieving the best 
or almost-best performance in all cases. Suffix averaging performs performs similarly, although as discussed 
earlier, it is not as amenable to on-the-fly computation. Compared to these schemes, a simple average of all 



iterates is significantly suboptimal, matching the results of Rakhlin et al. (2011). 



6 Discussion 

In this paper, we investigated the convergence behavior of SGD, and the averaging schemes required to 
obtain optimal performance. In particular, we considered polynomial-decay averaging, which is as simple to 
compute as standard averaging of all iterates, but attains better performance theoretically and in practice. 
We also extended the existing analysis of SGD by providing new finite-sample bounds on individual SGD 
iterates, which hold without any smoothness assumptions, for both convex and strongly-convex problems. 
Finally, we provided new bounds for suffix averaging. While we focused on standard gradient descent, our 
techniques can be extended to the more general mirror descent framework and non-Euclidean norms. 

An important open question is whether the C'(log(T)/T) rate we obtained on the individual iterate w^, 
for strongly-convex problems, is tight. This question is important, because running SGD for T iterations, 
and returning the last iterate wt, is a very common heuristic. If the C'(log(T)/T) bound is tight, it 
means practitioners should not return the last iterate, since better 0{1/T) rates can be obtained by suffix 
averaging or polynomial-decay averaging. Alternatively, a 0{1/T) bound on the last iterate can indicate 



that returning the last iterate is indeed justified. For a further discussion of this, see Shamir ( 2012 ). Another 
question is whether high-probability versions of our individual iterate bounds (Thm. [Ij and Thm. [2| can be 
obtained, especially in the strongly-convex case. Again, this question has practical implications, since if a 
high-probability bound does not hold, it might imply that the last iterate can suffer from high variability, 
and should be used with caution. Finally, the tightness of Thm. [2] is still unclear. In fact, even for the 
simpler case of (non-stocahstic) gradient descent, we do not know whether the behavior of the last iterate 
proved in Thm. [2] is tight. In general, for an algorithm as simple and popular as SGD, we should have a 
better understanding of how it behaves and how it should be used in an optimal way. 
Acknowledgements: We thank Simon Lacoste-Julien for helpful comments. 
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